Handprint recognition test deck

ABSTRACT

A system and method for creating a test deck to qualify and test forms processing systems, including preparing a handprint snippet data base containing labeled handprint image snippets representing a unique hand, preparing a form description file and a data content file, selecting handprint snippets from the handprint snippet data base to formulate a form using the data content file, creating a form image using the selected snippets according to the form description file and printing the form image.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates generally to forms processing technologyand more particularly to a system and method to create test materialsthat can evaluate and help improve systems that recognize handprintedfields.

2. Background Art

Forms processing technology can encompass many types of systems andinclude many steps including capturing hand-printed data fromquestionnaires and putting the data into a computer. Many organizationsdoing forms processing use traditional “heads-down” keying from paper(KFP); and “heads-up” keying from image (KFI).

Traditional “heads-down” keying from paper (KFP) is an approach thatinvolves human keyers sitting in front of a computer terminal, andlooking down at a form placed on a rack. They read the data placed onthe form by the respondent, and manually key this data into the computerusing a KFP software package. People are not very consistent whenperforming routine tasks for long periods of time, and this process isprone to human error. A major source of error with KFP is placing thedata in the wrong field.

“Heads-up” keying from image (KFI) is an approach that uses anelectronic scanner to scan the forms before sending the electronic imageof the form to a computer screen, along with fields where the humankeyer is to key in data. The name comes from the fact that the keyer islooking straight ahead at the screen at all times. This method tends tobe more accurate since it greatly reduces the incorrect field problemmentioned above for KFP. It is also often faster than KFP.Unfortunately, it still involves humans which are a constant source oferrors.

A major problem with capturing handprinted data from forms filled out byhuman respondents is measuring the accuracy and efficiency of the totalsystem. Former testing methods, such as using a handprint font, may besatisfactory for production readiness tests, but clearly are notadequate to claim to measure handprint recognition accuracy andefficiency. This is partly because they are considered “too neat” andwould thus give an artificially high estimate of accuracy relative tothe “real world.”

Creating a test deck manually gives realistic variability but is atime-consuming and demanding task, and you only have one unique deck.Creating a corresponding TRUTH file to evaluate the accuracy of thesystem's processing adds significantly to the complexity of this task. Atruth file is an accurate representation of the handprinted data on theforms. To be able to produce a deck of handprinted forms and thecorresponding TRUTH file in a more timely and efficient manner wouldprovide a valuable quality assurance tool.

Typical data collection forms request hand-printed response (rather thancursive) when clarity is required. However, there is great variabilityin the handprints prepared by the population in general. In order toperform adequate system testing and shakedown, a sufficient number ofexamples are required so that realistic variability can be characterizedfor use during testing and system evaluation.

Automation and cost savings can be realized by incorporating handprintOptical Character Recognition (OCR), so that computer systems attempt torecognize the handprinted fields, and send low confidence fields to KFI.The method of this patent creates test materials that make it possibleto assess the accuracy of an entire form processing system more easilyand consistently, as well as measure efficiency, regardless of whetherKFP, KFI, OCR, or all of these are used in any combination. Thisinvention simplifies the expensive and laborious process associated withthe (hand made) test forms wherein the forms are keyed twice from paper,and discrepancies verified and corrected by a third person.

BRIEF SUMMARY OF THE INVENTION

Our invention is a method to measure handprint recognition accuracy andefficiency by creating a test deck to qualify and test handprintrecognition systems. This includes preparing a handprint snippet database containing labeled handprint image snippets that collectively or inpart, may approximate actual respondent's handprinted characters. Themethod includes creating a page layout file, preparing a formdescription file as well as a data content file and then selectinghandprint snippets from the handprint snippet database to populate aform creating a “completed” form image using the selected snippetsaccording to the form description file. When printed, these forms mayconstitute a Digital Test Deck™.

BRIEF DESCRIPTION OF THE DRAWING(S)

FIG. 1 schematically illustrates the test deck creation method.

FIG. 2 is an example of a blank form.

FIG. 3 is an example of a filled out form.

FIG. 4 is a form populated with actual handprint snippets.

FIG. 5 shows a form processing system.

FIG. 6 shows a testing concept at the field level.

FIG. 7 shows a diagram of the mathematical system for measuring error.

FIG. 8 shows an example plot of error results for this system as afunction of the number of samples for this system.

DETAILED DESCRIPTION OF THE INVENTION

A digitally created printout 10 of this invention is created using atest deck system 12 as shown in FIG. 1. The print out 10 is sometimesreferred to as the Digital Test Deck™ (DTD™), or simply the test deck10, and can be used in a handprint recognition system to test the systemfor accuracy and to improve the efficiency of the forms processingtechnology. When accompanied by a TRUTH file 14, which accuratelyrepresents the data printed on the form and defines the expected outputof the forms processing system, the test deck system 12 produces a veryrealistic and accurate set of testing forms. The DTD™ printout 10 can beused for production readiness and for baseline testing of formsprocessing systems such as shipping labels, bank checks and surveys foraccuracy and efficiency.

The test deck system 12 for creating the DTD™ printout 10 includes afile such as an Adobe PDF or PostScript file that includes instructionsto describe an appropriate form 15. It also includes a page descriptionfile (PD file) 16 formed by page description software 18. The PD file 16may be an Adobe Quark Express file. The PD file 16 contains theinstructions necessary to determine what form to use and how that formis to be filled out. The form may be a “blank” form 20 (See FIG. 2) thatcan be directed by the PD file 16 to be sent to a printer 22 which couldbe a traditional offset printer, a color digital printer, or other typesof printers and then prints the page images of the forms in duplex orsimplex, depending on the requirement.

This test deck system 12 also includes a form description file (FDF) 24that describes the characteristics of the form's data fields. The FDF 24contains information such as field length and whether the field is acheck box, or a write-in field for marks or characters respectively.Field length can be measured in the number of marks or characters. Thetest deck system also includes a variable data database (VDDB) 26 thatdescribes the desired content of the simulated respondent entries. TheVDDB 26 is an ASCII database that tells what goes where on the printout.In the case where the field is a check box field, the VDDB 26 describeswhich check boxes are checked and which are blank and the type of markthat is expected. When the field is a write in field, the VDDB 26contains a simulated response, such as a last name, “SMITH.” The VDDB 26allows customization depending on the form 20 used and the jobrequirements.

The test deck system 12 shown in FIG. 1 uses Handprint CharacterSnippets (HPC) 28 organized in an HPC snippet database 30. The HPCsnippet database 30 can include characters, letters, symbols, or partsthereof, where a single character is defined by the set of legal symbolsfor the particular user or job requirement. The HPC snippet database 30has an appropriate identification/numbering scheme for subsequentincorporation into the test deck system 12. The individual HPC snippet28 is preferably an image clip containing a hand printed symbol(character, mark or punctuation mark) obtained from a real person. Byusing the digitally created printout 10 with HPC snippets 28, asdescribed below, it is possible to test the accuracy of form processingby an Optical Character Recognition (OCR) method and/or system with theassurance the results are realistic.

Optical character recognition is often referred to as OCR or ICR(Intelligent Character Recognition). Optical character recognitionrefers to the process of automatically recognizing write-in fields fromthe scanned image of the form. Originally, OCR referred to therecognition of machine print, and is used by government institutions,but has also been used as a generic name for recognition of handprint aswell. Sometimes, the type of OCR used for handprint is called “HandprintOCR.” Optical Mark Recognition (OMR) is a related process and refers tothe process of automatically recognizing the answers to check-boxquestions from the scanned image of the form. It is a more advancedversion of its predecessor which was known as “mark sense” whichdetermined if a particular circle or other shape was completely filledin by the respondent.

The VDDB 26 contains acceptable responses for each data field on theform effectively simulating the data that would be obtained from a realrespondent that had completed the form but with advantages that areimportant to improved accuracy and efficiency mentioned above. The VDDB26 can be created from a dictionary of acceptable data for each field.The VDDB 26 is capable of specifying the placement of letters such asupper case or lower case to more closely simulate actual respondent dataand/or changing the placement of the handprint character snippet inreference to a boundary by one or more of the handprint charactersnippet's position, angle, or size. These letters can extend below theline as if written by hand. These letters include those that haveextensions such as g, j, p, q, and y, and can be composed from one ormore HPC snippets 28 as will be discussed in detail below.

The test deck system 12 shown in FIG. 1 incorporates a HPC snippetdatabase 30 of characters, symbols and/or digital images. The HPCsnippet database 30 classifies the characters and/or digital images bysimilarity and/or a feature set. The HPC snippet database 30 containsall the valid or verified character images which have been collected todate in the project. These can come from one or more different sourcesand mixed together using a computer. The one or more of the sources caninclude computer-generated samples. The HPC snippet database 30 works inconjunction with Field Dictionaries 31 and/or alternatively, withsubfield Dictionaries, which can be ASCII files. This dictionary 31contains a table of valid entries for any given field on the blank form20. When there is a complex linkage between fields in a form, such asbetween the preparer's gender and first name, the dictionary can besubdivided into subfield directories. This method creates the test deckused to qualify and test handprint recognition systems by preparing ahandprint snippet database containing labeled handprint image snippets;preparing a form description file and page description file to describea form; preparing a variable data database that describes the desiredcontent of the simulated respondent entries using the handprintcharacter snippets; populating the form using the variable data databasein conjunction with the form description file, the page descriptionfile, and the handprint snippet database; and printing the completedform.

The test deck system uses a “Hand” 32 defined as an “adequate” supply ofcharacters or symbols, preferably from the HPC snippets database 30, tocreate the set of handprint field snippets 33 which satisfy the fielddata requirements for the VDDB file 26. The process is controlled by thevariable data snippet maker (VDSM) software 34. The hand 32 can be aHomogeneous Hand which is a Hand in which all the characters are similardimensional characteristics and features (i.e., slant, line width,height, etc.), so that the set of handprint field snippets 33 createdfrom this homogeneous hand 32 look as though they were filled in by oneindividual. VDSM software 34 describes the logic that controls handprintfield snippet generation including the production of the set ofhandprint field snippets 33 that are to be put on the form image 36 ofthe appropriate form 15. The VDSM software 34 uses the data fieldinformation found in the VDDB file 26 and the field description found inthe FDF 24 to appropriately select HPC snippets 28 and electronicallypaste the set of handprint field snippets 33 together. The handprintfield snippets 33 are subsequently electronically pasted onto thedigital test deck file 38, using variable data printing software 40 tovary or raster images as is known by those in the art. It is sometimesadvantageous to incorporate a random or alternatively a defined noiseinto the data to simulate certain environmental or expected effects.

At the same time the set of handprint filed snippets 33 are used tocreate the digital test deck 38, the well-defined TRUTH file 14 is beingconstructed to contain the “answer” expected from the system using theform supplied. The HPC snippets 28 are digitally pasted together toconstruct the set of handprint field snippets 33 described above. Thisset of handprint field snippets 33 will be subsequently printed on theprintout 10. The HPC snippet database 30 can be used in conjunction withthe VDSM software 34, and data field information as well as the fielddescriptions, to create snippets that are placed on the fringe ofacceptable units; such as a snippet touching a boundary or anothercharacter, with at least one point. The HP snippets 28 can also be castinto a vector representation if necessary, or placed in reference to aboundary by one or more of the HP snippets.

A form image 42, the page description file 16, and the handprinted fieldsnippets 28 are processed through variable data printing software (VDP)40 to create the Digital Test Deck™ (DTD™) file 38. The DTD™ file 38 issent to a printer 22, which may be offset or digital, to produce theDTD™ printout 10. This printout and the DTD™ TRUTH file 14 comprise thesystem output that is available to the customer for test purposes.

Digital Test Deck™

The DTD™ 10 containing simulated human handprint, looks like a formprepared by a real person even through they are printed by a digitalprinter and contain perfectly known data. Using these decks, a formsprocessing system may be tested for accuracy and efficiency, regardlessof the technology used for the data capture. More specifically, thestate of the processing system may be reliably assessed at any point intime. A well-defined TRUTH FILE 14 is also developed by the handprintrecognition system using extracted data from the forms to accuratelyrepresent the data placed within the DTD™. The truth file 14 wouldcontain the “answer” that is expected from a handprint recognitionsystem, and can be used to determine the accuracy of said system.

FIG. 2 shows a blank form, which is a portion of the Year 2000 DecennialCensus “short” form. This is an example of a form that would benefitfrom the described system and method. There are places on this (blank)form 20, called fields 44, where the respondent is asked to print theanswers to questions posed by the Census Bureau. When the respondentcompletes these fields, the form might look like FIG. 3 displayingfictitious data 46 in fields 44. Most people would say this was anactual Census form image, albeit a rather neat one, but it was actuallycreated using a handprint font on a computer. This is one example of adigital test form. A suitable number of different digital test formswould constitute the test deck.

The basic properties of the test deck as defined above are:

-   -   looks and feels like a real form with handprinted responses, but        really printed on a high quality digital color (or black &        white) printer;    -   form content designed to test critical system aspects;    -   reproducible as required;    -   compliments, but does not replace forms with “real data”        content;    -   consistent test input;    -   test the data capture system “end-to-end;” and    -   know the “truth” perfectly.

The fourth bullet indicates that using handprint fonts results in arather excessively “neat” simulated form, being created, and so the formis of limited use in actually measuring OCR data capture quality per se.However, using actual handprint “snippets” 28 in the creation of thetest deck 10 gives a very realistic appearance, as shown in FIG. 4(again using fictitious data).

The test deck 10 is used to:

-   -   verify correct operation of critical system components;    -   establish a measurable system performance baseline;    -   test system operation at each software/hardware change;    -   test daily production operational readiness before scanning; and    -   test consistency of system between scan operations, sites and        over time    -   verify system “improvements.”

This invention enables “outside-in” testing. If a perfectly known inputis inserted into the system, and (mostly) the correct answers come out,then it is unlikely that there is anything seriously wrong in between.Alternatively, “inside-out” testing, analyses all possible internalvariables such as a measure of the lamp intensity on the left-hand sideof the scanner, or the speed of the scanner transport, etc. The problemwith an “inside-out” approach is that it may literally fill up filecabinets with data in this manner, and it will be the element or factorthat is not tested that causes the system to fail or create erroneousdata. The “outside-in” approach used in this invention is advantageousbecause testing is simple, cost-effective, accurate, and consistent.

How to Measure Accept Rate and Accuracy of a Forms Processing System

FIG. 5 shows a typical forms processing system 50 which uses automaticrecognition (OCR) to do the bulk of the data capture workload, and KFIfor data capture of those rejected fields for which the OCR system isnot confident. The terms Accept Rate and Accuracy Rate are used as ameasure of the accuracy of the system under test. In automatedrecognition of hand printed fields, the Accept Rate is the fraction ofthe fields in which the OCR has high confidence, usually expressed as apercent. These “accepted” fields are the ones noted for OCR accuracy.Accepted fields are not sent to keyers except for quality assurancepurposes, while in noting the OCR “accepted” fields, that fraction ofthe (non-blank) fields that are “correct” is the Accuracy Rate, usuallyexpressed as a percent. Also shown are the steps taken to measure theaccept rate and accuracy of the system. Finally the Error Rate foreither OCR or OMR is defined as 100% minus the Accuracy Rate in percent.So for example, if the Accuracy Rate is 99.2%, the Error Rate is 0.8%.

Related to the error rate is the Reject Rate. The Reject Rate forrecognition is 100% minus the Accept Rate in percent. So, if the AcceptRate is 80%, the Reject Rate is 20%. Rejected fields are the lowconfidence fields from the OCR, and are sent to keyers to be keyedbecause the automatic OCR isn't sure it has the correct answer. For OMR,the Accuracy Rate is that fraction of all the check-box fields that arecorrect, usually including blanks. Blanks are commonly included inscoring OMR because there is no way for the computer to know if acheck-box contains a mark or not without looking at it, and so an emptycheck-box which is properly identified as such is considered a “correct”output of OMR. Scoring (also called: an accuracy rate) includes thecalculation of the accuracy of an OCR or keying (data collection)system. A TRUTH File 14 also referred to as the groundtruth or answerfile, contains the set of values expected as output from an OCR/OMRsystem upon extracting the respondent completed information from a form.When the present invention is used this TRUTH File 14 can be generatedas described below.

A portion of the test deck system 12 is shown schematically in FIG. 6where all of the fields 44 in all the forms are being tested together.The fields 44 are pulled out one at a time for testing. If the handprintwas J-O-S-E and the resultant ASCII was JOSE, that would be a correctfield, termed a “hard match”, meaning each and every character iscorrect in a field. On the other hand, if the handprint was C-H-AO andif the resultant ASCII was CHAD, there would be an error using the hardmatch criterion.

Here, the total error estimate is:

ε_(T)=ε_(O)A_(O)+ε_(i)(1−A _(O))+ε_(t)

-   -   ε_(O)=OCR Error    -   A_(O)=OCR Accept Rate    -   ε_(i)=KFI Error    -   ε_(t)=“Truth” Error    -   ε_(T)=Total Data Error and the Estimator is shown with a ˆ over        the ε_(T).

FIG. 7 shows a mathematical representation of the results of applyingthis test deck system 12.

How to Associate Measurement Error with Number of Samples

If an accuracy in the neighborhood of p=99%, corresponding to an errorrate of q=1% is needed, then the following relationship approximatelydescribes the one-sigma sampling error in the estimate:$\sigma = \sqrt{\frac{pq}{n}}$where n is the number of samples. FIG. 8 is a plot of the measurementerror as a function of the number of samples.

Using this method, it is possible to determine how many samples areneeded to obtain the desired level of quality in estimating the desiredsystem accuracy. FIG. 8 shows that many samples may be needed to testthe test deck system 12 properly. This invention makes creating a largenumber of samples cost-effective compared to previous manual methods.

Test Materials for Forms Processing Systems

Six basic types of test materials are used to test forms processingsystems are:

-   -   1. Blank forms;    -   2. Forms hand-marked by volunteers;    -   3. Real forms filled out by respondents;    -   4. Images of real forms on CD-ROM;    -   5. Lithographically printed forms with simulated respondent        marks;    -   6. Digitally printed forms with simulated respondent marks.        Each of these types of test materials has a purpose, and has        advantages and disadvantages. By a suitable combination of these        materials, tests may be devised to cover all testing needs.

While the invention has been described in connection with variousembodiments, it is not intended to limit the scope of the invention tothe particular form set forth, on the contrary, it is intended to coversuch alternative, modification, and equivalents as may be includedwithin the spirit and scope of the invention as defined by the appendedclaims. In particular, the test decks described herein could becomprised of a wide variety of printed forms in addition toquestionnaires; for example, bank checks, shipping labels, and othertypes of printed forms.

1. A method for creating a test deck to qualify and test handprint recognition systems, the method comprising: (a) preparing a handprint snippet database containing labeled handprint image snippets; (b) preparing a form description file and page description file to describe a form; (c) preparing a variable data database that describes the desired content of the simulated respondent entries using the handprint character snippets; (d) populating the form using the variable data database in conjunction with the form description file, the page description file, and the handprint snippet database; and (e) printing the completed form.
 2. The method of claim 1 further comprising where the handprint character snippets are grouped representing a unique Hand.
 3. The method of claim 1 further comprising incorporated noise into the handprint snippet database.
 4. The method of claim 1 further comprising changing the placement of the handprint character snippet in reference to a boundary by one or more of the handprint character snippet's position, angle, or size.
 5. The method of claim 1 further comprising incorporating handprint character snippets from more than one source.
 6. The method of claim 1 further comprising creating variable handprint character snippets on fringe of acceptable limits so that the snippet touches the boundary or another character in at least one point.
 7. The method of claim 1 wherein the handprint snippets are cast into a vector representation.
 8. The method of claim 1 further comprising software to help define the placement of handprint character snippets on the form.
 9. The method of claim 1 further comprising creating the handprint character snippet database by: (a) collecting multiple handprint character snippets sampled from each contribution; and (b) mixing the samples.
 10. The method of claim 9, wherein the samples are computer generated.
 11. The method of claim 9, wherein multiple contributions are used.
 12. A system of creating a test deck from handprint character snippets to qualify and test handprint recognition systems (OCR) comprising: (a) a database of handprint character snippets for use in creating a variable handprint character snippet or field snippet; (b) data relating to the test deck, including a form description file a page description, and a variable data database; (c) variable data printing software receiving the data from a) and b); and (d) printing software for printing one or more variable characters snippets using the variable data database to create a digital test deck.
 13. The system of claim 12 further comprising computing software for controlling the image raster processor while rastering the variable characters or field snippets.
 14. The system of claim 12, further comprising rastering the handprint character snippets to incorporate changes in angle, size, or position.
 15. The system of claim 12, further comprising the variable data database coupled to multiple sources of handprint character snippets.
 16. The system of claim 12, further comprising variable character snippets, including various variable character snippets positioned in relation to four boundaries of a rectangle.
 17. The system of claim 16, further comprising including the variable character data database and handprint character snippets that have been rastered four times resulting in one version not touching a boundary and the other four touching each of four different boundaries.
 18. The system of claim 12 further comprising printing the digital test deck comprising a plurality of forms for use in qualifying and testing an OCR system. 